17 research outputs found
Multilingual Augmentation for Robust Visual Question Answering in Remote Sensing Images
Aiming at answering questions based on the content of remotely sensed images,
visual question answering for remote sensing data (RSVQA) has attracted much
attention nowadays. However, previous works in RSVQA have focused little on the
robustness of RSVQA. As we aim to enhance the reliability of RSVQA models, how
to learn robust representations against new words and different question
templates with the same meaning is the key challenge. With the proposed
augmented dataset, we are able to obtain more questions in addition to the
original ones with the same meaning. To make better use of this information, in
this study, we propose a contrastive learning strategy for training robust
RSVQA models against diverse question templates and words. Experimental results
demonstrate that the proposed augmented dataset is effective in improving the
robustness of the RSVQA model. In addition, the contrastive learning strategy
performs well on the low resolution (LR) dataset.Comment: This paper was submitted to the JURSE 2023 conference on November 5,
202
Change Detection Meets Visual Question Answering
The Earth's surface is continually changing, and identifying changes plays an
important role in urban planning and sustainability. Although change detection
techniques have been successfully developed for many years, these techniques
are still limited to experts and facilitators in related fields. In order to
provide every user with flexible access to change information and help them
better understand land-cover changes, we introduce a novel task: change
detection-based visual question answering (CDVQA) on multi-temporal aerial
images. In particular, multi-temporal images can be queried to obtain high
level change-based information according to content changes between two input
images. We first build a CDVQA dataset including multi-temporal
image-question-answer triplets using an automatic question-answer generation
method. Then, a baseline CDVQA framework is devised in this work, and it
contains four parts: multi-temporal feature encoding, multi-temporal fusion,
multi-modal fusion, and answer prediction. In addition, we also introduce a
change enhancing module to multi-temporal feature encoding, aiming at
incorporating more change-related information. Finally, effects of different
backbones and multi-temporal fusion strategies are studied on the performance
of CDVQA task. The experimental results provide useful insights for developing
better CDVQA models, which are important for future research on this task. We
will make our dataset and code publicly available
Overcoming Language Bias in Remote Sensing Visual Question Answering via Adversarial Training
The Visual Question Answering (VQA) system offers a user-friendly interface
and enables human-computer interaction. However, VQA models commonly face the
challenge of language bias, resulting from the learned superficial correlation
between questions and answers. To address this issue, in this study, we present
a novel framework to reduce the language bias of the VQA for remote sensing
data (RSVQA). Specifically, we add an adversarial branch to the original VQA
framework. Based on the adversarial branch, we introduce two regularizers to
constrain the training process against language bias. Furthermore, to evaluate
the performance in terms of language bias, we propose a new metric that
combines standard accuracy with the performance drop when incorporating
question and random image information. Experimental results demonstrate the
effectiveness of our method. We believe that our method can shed light on
future work for reducing language bias on the RSVQA task
RRSIS: Referring Remote Sensing Image Segmentation
Localizing desired objects from remote sensing images is of great use in
practical applications. Referring image segmentation, which aims at segmenting
out the objects to which a given expression refers, has been extensively
studied in natural images. However, almost no research attention is given to
this task of remote sensing imagery. Considering its potential for real-world
applications, in this paper, we introduce referring remote sensing image
segmentation (RRSIS) to fill in this gap and make some insightful explorations.
Specifically, we create a new dataset, called RefSegRS, for this task, enabling
us to evaluate different methods. Afterward, we benchmark referring image
segmentation methods of natural images on the RefSegRS dataset and find that
these models show limited efficacy in detecting small and scattered objects. To
alleviate this issue, we propose a language-guided cross-scale enhancement
(LGCE) module that utilizes linguistic features to adaptively enhance
multi-scale visual features by integrating both deep and shallow features. The
proposed dataset, benchmarking results, and the designed LGCE module provide
insights into the design of a better RRSIS model. We will make our dataset and
code publicly available
GETNET: A General End-to-end Two-dimensional CNN Framework for Hyperspectral Image Change Detection
Change detection (CD) is an important application of remote sensing, which
provides timely change information about large-scale Earth surface. With the
emergence of hyperspectral imagery, CD technology has been greatly promoted, as
hyperspectral data with the highspectral resolution are capable of detecting
finer changes than using the traditional multispectral imagery. Nevertheless,
the high dimension of hyperspectral data makes it difficult to implement
traditional CD algorithms. Besides, endmember abundance information at subpixel
level is often not fully utilized. In order to better handle high dimension
problem and explore abundance information, this paper presents a General
End-to-end Two-dimensional CNN (GETNET) framework for hyperspectral image
change detection (HSI-CD). The main contributions of this work are threefold:
1) Mixed-affinity matrix that integrates subpixel representation is introduced
to mine more cross-channel gradient features and fuse multi-source information;
2) 2-D CNN is designed to learn the discriminative features effectively from
multi-source data at a higher level and enhance the generalization ability of
the proposed CD algorithm; 3) A new HSI-CD data set is designed for the
objective comparison of different methods. Experimental results on real
hyperspectral data sets demonstrate the proposed method outperforms most of the
state-of-the-arts
Self-Paced Curriculum Learning for Visual Question Answering on Remote Sensing Data
Answering questions with natural language by extracting in-formation from image has great potential in various applica-tions. Although visual question answering (VQA) for naturalimage has been broadly studied, VQA for remote sensing datais still in the early research stage. For the same remote sens-ing image, there exist questions with dramatically differentdifficulty-levels. Treating these questions equally may mis-lead the model and limit the VQA model performance. Con-sidering this problem, in this work, we propose a self-pacedcurriculum learning (SPCL) based VQA model with hard andsoft weighting strategies for remote sensing data. Like humanlearning process, the model is trained from easy to hard ques-tion samples gradually. Extensive experimental results on twodatasets demonstrate that the proposed training method canachieve promising performance
From Easy to Hard: Learning Language-Guided Curriculum for Visual Question Answering on Remote Sensing Data
Visual question answering (VQA) for remote sensing scene has great potential in intelligent human–computer interaction system. Although VQA in computer vision has been widely researched, VQA for remote sensing data (RSVQA) is still in its infancy. There are two characteristics that need to be specially considered for the RSVQA task: 1) no object annotations are available in the RSVQA datasets, which makes it difficult for models to exploit informative region representation and 2) there are questions with clearly different difficulty levels for each image in the RSVQA task. Directly training a model with questions in a random order may confuse the model and limit the performance. To address these two problems, in this article, a multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features. Besides, a self-paced curriculum learning (SPCL)-based VQA model is developed to train networks with samples in an easy-to-hard way. To be more specific, a language-guided SPCL method with a soft weighting strategy is explored in this work. The proposed model is evaluated on three public datasets, and extensive experimental results show that the proposed RSVQA framework can achieve promising performance. Code will be available at https://gitlab.lrz.de/ai4eo/reasoning/VQA-easy2har